Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: fix fileExistsOnFirstNode check #105514

Merged
merged 1 commit into from
Jun 27, 2023

Conversation

healthy-pod
Copy link
Contributor

For some reason, the current form of fileExistsOnFirstNode can return found=true when it should return found=false.

This can be reproduced by running the multitenant-upgrade roachtest and seeing it hang at:

multitenant_upgrade.go:154: test status: checking the pre-upgrade sql server still works after the system tenant binary upgrade

because of a TLS handshake error.

Note: the test is also broken because of another issue so with this fix it should now fail with:

(assertions.go:333).Fail:
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_upgrade.go:390
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_upgrade.go:189
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_upgrade.go:38
	            				main/pkg/cmd/roachtest/test_runner.go:1060
	            				GOROOT/src/runtime/asm_arm64.s:1172
	Error:      	Not equal:
	            	expected: [][]string{[]string{"23.1"}}
	            	actual  : [][]string{[]string{"22.2"}}

	            	Diff:
	            	--- Expected
	            	+++ Actual
	            	@@ -2,3 +2,3 @@
	            	  ([]string) (len=1) {
	            	-  (string) (len=4) "23.1"
	            	+  (string) (len=4) "22.2"
	            	  }
	Test:       	multitenant-upgrade

Release note: None
Epic: none

@healthy-pod healthy-pod requested a review from herkolategan June 24, 2023 20:52
@healthy-pod healthy-pod requested a review from a team as a code owner June 24, 2023 20:52
@healthy-pod healthy-pod requested review from smg260 and removed request for a team June 24, 2023 20:52
@cockroach-teamcity
Copy link
Member

This change is Reviewable

For some reason, the current form of `fileExistsOnFirstNode`
can return `found=true` when it should return `found=false`.

This can be reproduced by running the `multitenant-upgrade`
roachtest and seeing it hang at:
```
multitenant_upgrade.go:154: test status: checking the pre-upgrade sql server still works after the system tenant binary upgrade
```
because of a `TLS handshake error`.

Note: the test is also broken because of another issue
so with this fix it should now fail with:
```
(assertions.go:333).Fail:
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_upgrade.go:390
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_upgrade.go:189
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_upgrade.go:38
	            				main/pkg/cmd/roachtest/test_runner.go:1060
	            				GOROOT/src/runtime/asm_arm64.s:1172
	Error:      	Not equal:
	            	expected: [][]string{[]string{"23.1"}}
	            	actual  : [][]string{[]string{"22.2"}}

	            	Diff:
	            	--- Expected
	            	+++ Actual
	            	@@ -2,3 +2,3 @@
	            	  ([]string) (len=1) {
	            	-  (string) (len=4) "23.1"
	            	+  (string) (len=4) "22.2"
	            	  }
	Test:       	multitenant-upgrade
```

Release note: None
Epic: none
Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, the current form of fileExistsOnFirstNode can return found=true when it should return found=false.

The reason for that is $? which was substituted by bash prior to executing the commands passed via -c. It needed to be escaped; e.g.,

bash -c "test -e '/foobar'; echo $?"
0
bash -c "test -e '/foobar'; echo \$?"
1

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This possibly caused the failure on #105176 as well.
I'm in favour of resolving this as soon as possible to avoid any other unnecessary failures.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @smg260)

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I exposed the fileExistsOnFirstNode function on the roachprod cli to put this through a few scenarios. The original $(test -e ...); echo $? command seems to work, even without the escape (there's an ssh escape down the line that might be the reason for this).

The flaw in the previous revision is actually much more sinister. There's a \n in Stdout that makes the rudimentary result.Stdout == "0" insufficient. It will always state the file as not being present.

The fix to the previous revision is then obvious: strings.TrimSpace(result.Stdout) == "0"
But in hindsight, perhaps it's not best to rely on the output, unless the matching is improved and not error prone.
Checking for other unexpected error codes is also a nice addition.

Thanks for the fix! I'm for reverting back to a command failure that does not log to make this more robust, unless anyone has any objections.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @smg260)

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: If the check always states 'not found' as true, then it does make me wonder about the details of the test failure if the certs are always distributed and how that causes a problem?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @smg260)

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix for this was also included in the first commit of #105454 (using strings.TrimSpace). I kind of prefer the simplicity of that as the echo $? command would only equal "0" if the command succeeded and the file exists. Roachprod errors (err != nil) are returned to the caller as expected.

That said, I'm fine with this change as well and with merging it sooner. Thanks for fixing!

if (result.RemoteExitStatus != 0 && result.RemoteExitStatus != 1) || err != nil {
// Unexpected exit status (neither 0 nor 1) or non-nil error. Return combined output along with err returned
// from the call if it's not nil.
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic could be simplified by lifting the shared error (running '%s' ...) and calling Wrapf if err != nil

testCmd := `$(test -e ` + path + `);`
// Do not log output to stdout/stderr because in some cases this call will be expected to exit 1.
result, err := c.runCmdOnSingleNode(ctx, l, c.Nodes[0], testCmd, true, nil, nil)
if (result.RemoteExitStatus != 0 && result.RemoteExitStatus != 1) || err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does result.Err play a role here? Shouldn't we check for it too? (cc @smg260)

@renatolabs
Copy link
Contributor

If the check always states 'not found' as true

I think it's the reverse -- the function would return 'found' as false, causing new certificates to be generated each time you start node 1.

Copy link
Contributor

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod and @renatolabs)


pkg/roachprod/install/cluster_synced.go line 1537 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

How does result.Err play a role here? Shouldn't we check for it too? (cc @smg260)

Yes. When running commands over ssh, via runCmdOnSIngleNode, res.Err is more relevant, since we don't expect the standalone err to non-nil very often. Since we're checking for 1 or 0 to covers an acceptable res.Err (file not existing), we should prob do something like (roughly):

res,err := runOnSingleNode(..)
if err != nil return false, err // always return the non command error immediately
// special cases for `test -e`
if res.exitStatus == 0 return true, nil
if res.exitStatus == 1 return false, nil
else
// any other exit code we map to an error for the caller
return false, res.Err

@healthy-pod
Copy link
Contributor Author

TFTRs!
I will merge this as is to fix the bug and will leave it up to @renatolabs and test-eng to decide how to move forward with this in Renato's commit.

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 27, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants